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1 Introduction 

One of the most succesful recent signal processing paradigms has been the sparse 
coding/dictionary design model [HIE]. In this model, we try to represent a given 
d x n data matrix X of n points in W l written as columns via a solution to the 
problem 

{W*,Z*} = {W.(K,X,q),Zt(K,X,q)} 

= arg min } \\Wz k - x k \\ 2 , \\z k \\ < q, (1.1) 

zm Kxn ,wm dxK 

k 

or its Z coordinate convexification 

{W„Z,} = {W.(K,X,\),Z„(K,X, A)} 
= arg min V \\Wz k - x k \\ 2 + A||^ fc ||i. (1.2) 

k 

Here, {W, Z} are the dictionary and the coefficients, respectively, and z k is the 
fcth column of Z. K , q, and A are user selected parameters controlling the power 
of the model. 

More recently, many models with additional structure have been proposed. 
For example, in [5J [2J, the dictionary elements are arranged in groups and the 
sparsity is on the group level. In [3 [3 [7] , the dictionaries are constructed to be 
translation invariant. In the former work, the dictionary is constructed via a 
non-negative matrix factorization. In the latter two works, the construction is 
a convolutional analogue of 11.21 or an l p variant, with < p < 1. In this short 
note we work with greedy algorithms for solving the convolutional analogues 
of 11.11 Specifically, we demonstrate that sparse coding by matching pursuit 
and dictionary learning via K-SVD pQ can be used in the translation invariant 
setting. 

2 Matching Pursuit 

Matching pursuit [6] is a greedy algorithm for the solution of the sparse coding 
problem 

min \ \Wz — x\ | 2 , 

z 

IMIo < q, 

where the d x k matrix W is the dictionary, the k x 1 z is the code, and x is an 
d x 1 data vector. 



1. Set e = x, and z the fc-dimensional zero vector. 

2. Find j = argmax || W^eHl- 

i 

3. Set a = Wjx. 

4. Set e <— e — aWj, and Zj = Zj + a. 

5. Repeat for q steps 

Note that with a bit of bookkeeping, it is only necessary to multiply W 
against x once, instead of q times. This at a cost of an extra 0(K 2 ) storage: 
set e r and a r be e and a from the rth step above. Then: 

W T e Q = W T x; 

W T ei = W T x - a W T W j01 
and so on. If the Gram matrix for W is stored, this is just a lookup. 

2.1 Convolutional MP 

We consider the special case 

k 

min || Wj * Zj — x\\ 2 , 

Pllo < q, 

where each Wj is a filter, and ~z is all of the responses. 

Note that the Gram matrix of the "Toeplitz" dictionary consisting of all the 
shifts of the Wj is usually too big to be used as a lookup table. However, because 
of the symmetries of the convolution, it is also unnecessary; we only need store 
a 4 * hf x Wf x k 2 array of inner products, where hf and Wf are the dimensions 
of the filters. 

With this additional storage, to run q basis pursuit steps with k filters on 
an h X w image costs the computation of one application of the filter bank plus 
O(kqhw) operations. 

3 Learning the filters 

Given a set of x, we can learn the filters and the codes simultaneously. Several 
methods are available. A simple one is to alternate between updating the codes 
and updating the filters, as in K-SVD pQ: 

1. Initialize k hf x Wf filters {wi, Wk}- 

2. Solve for z as above. 
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3. For each filter wj, 

• find all locations in all the data images where Wj is activated 

• extract the hf X Wf patch E p from the reconstruction via z at each 
activated point p. 

• remove the contribution of Wj from each E p (i.e. E p <— E p — ci Pt j\Wj, 
where ci Pt j\ was the activation determined by z). 

• update wj <- PCA(E P ) 

4. Repeat from step 2 until fixed number of iterations. 

We note that the forward subproblcm (finding Z with W fixed) is not convex, 
and so the alternation is not guaranteed to decrease the energy or to converge 
to even a local minimum. However, in practice, on image and audio data, this 
method generates good filters. 

4 Some experiments 

We train filters on three data sets: the AT&T face database, the motorcycles 
from a Caltech database, and the VOC PASCAL database. For all the images in 
all our experiments, we perform an additive contrast normalization: each image 
x is transformed into x' = x — x * b, where b is a 5 x 5 averaging box filter. This 
is very nearly transforming x' = V 2 x, that is, using the discrete Laplacian of 
the image instead of the image. Using the Laplacian would correspond to using 
the energy 



that is, the energy sees the difference between gradients, not intensities. 
4.1 Faces 

The AT&T face database, available at http : //www. cl . cam. ac .uk/research/dtg/ attarchive/f acedat abase 

is a set of 400 images of 40 individuals. The faces are centered in each image. 
We resize each image to 64 x 64 and contrast normalize. We train 8 16 x 16 
filters. After training the filters we find the feature maps of each image in the 
database, obtaining a new set of 400 8 channel images. We take the elementwisc 
absolute value of each of the 8 channel images, and then average pool over 8x8 
blocks. We then train a new 16 element dictionary on the subsampled images. 
In figure [1] we display the first level filters, and the second level filters up to 
shifts of size 8 and sign changes of the first level filters.. 
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Figure 1: First and second layer filters from faces 




Figure 2: a contrast normalized face, and its reconstruction from 40 filter re- 
sponses. 



4.2 Caltech motorcycles 

We also train on the motorbikes-side dataset, available at http : //www . vision, caltech . edu/html-f iles/arcl 
which consists of color images of various motorcycles. The motorcycles are cen- 
tered in each image. We convert each image to gray level, resize to 64 x 64, and 
contrast normalize. We train 8 16 x 16 filters. As before, we then train a new 
16 element dictionary on the subsampled absolute value rectified responses of 
the first level. In figure [3] we display the first level filters, and the second level 
filters up to shifts of size 8 and sign changes of the first level filters.. 
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Figure 3: First and second layer filters from motorcycles 
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Figure 4: A contrast normalized motorcycle, and its reconstruction from 40 
filter responses. 
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Figure 5: First and second layer filters from natural images 



4.3 Images from PASCAL VOC 

We also show results trained on "unclassified" natural images from the PASCAL 



visual object challenge dataset available at http : //pascallin. ecs . soton.ac.uk/challenges/VDC/ 

We randomly subsample 5000 grayscaled images by a factor of 1 to 4, and then 

pick from each image a 64 x 64 patch, and then contrast normalize. We train 8 

8x8 filters. We then train a new 4 x 4 64 element dictionary on the subsam- 

pled absolute value rectified responses of the first level. In figure [5] we display 

the first level filters, and the second level filters up to shifts of size 8 and sign 

changes of the first level filters. 

In order to show the dependence of the filters on the number of filters used, 
in figure|6]we display an 8, 16, and 64 element 16 x 16 dictionary trained on the 
same set as above. 
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Figure 6: Dictionaries with varying numbers of elements trained on natural 
images. 
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